Building Decision Tree Software Quality Classification Models Using Genetic Programming
نویسندگان
چکیده
Predicting the quality of software modules prior to testing or system operations allows a focused software quality improvement endeavor. Decision trees are very attractive for classification problems, because of their comprehensibility and white box modeling features. However, optimizing the classification accuracy and the tree size is a difficult problem, and to our knowledge very few studies have addressed the issue. This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for calibrating software quality classification models. The proposed technique is based on multiobjective optimization using strongly typed gp. Two fitness functions are used to optimize the classification accuracy and tree size of the classification models calibrated for a real-world high-assurance software system. The performances of the classification models are compared with those obtained by standard gp. It is shown that the gp-based decision tree technique yielded better classification models. A timely quality prediction of the software system can be useful in improving the overall reliability and quality of the software product. Software quality classification (sqc) models which classify program modules as either fault-prone (fp) or not fault-prone (nfp) can be used to target the software quality improvement resources toward modules that are of low quality. With the aid of such models, software project managers can deliver a high-quality software product on time and within the allocated budget. Usually software quality estimation models are based on software product and process measurements, which have shown to be effective indicators of software quality. Therefore, a given sqc technique aims to model the underlying relationship between the available software metrics data, i.e., independent variables, and the software quality factor, i.e., dependent variable. Among the commonly used sqc models, the decision tree (dt) based modeling approach is very attractive due to the model’s white box feature. A dt-based sqc model is usually a binary tree with two types of nodes, i.e., query nodes and leaf nodes. A query node is a logical equation which returns either true or false, whereas a leaf node is a terminal which assigns a class label. Each query node in the tree can be seen as a classifier which partitions the data set into subsets. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1808–1809, 2003. c © Springer-Verlag Berlin Heidelberg 2003 Building Decision Tree Software Quality Classification Models 1809 All the leaf nodes are labelled as one of the class types, such as fp or nfp. An observation or case in the given data set will “trickle” down, according to its attribute values, from the root node of the decision tree to one of its leaf nodes. This study focuses on calibrating gp-based decision tree models for predicting the quality of software modules as either fp or nfp. In the context of what constitutes a good dt-based model, two factors are usually considered: classification accuracy and model simplicity. Classification accuracy is often measured in terms of the misclassification error rates, whereas model simplicity of decision trees is often expressed in terms of the number of nodes. Therefore, an optimal dt is one that has low misclassification error rates and has a (relatively) few number of nodes. gp is a logical solution to the problems that require a multiobjective optimization, primarily because it is based on the process of natural evolution which involves the simultaneous optimization of several factors. Very few studies have investigated gp-based decision tree models, and among those none has investigated the gp-based decision trees for the sqc problem. Previous works related to gp-based classification models have focused on the standard gp process, which requires that the function and terminal sets have the closure property. This property implies that all the functions in the function set must accept any of the values and data types defined in the terminal set as arguments. However, since each decision tree has at least two different types of nodes, i.e., query nodes and leaf nodes, the closure property requirement of standard gp does not guarantee the generation of a valid individual(s). Strongly Typed Genetic Programming (stgp) has been used to alleviate the closure property requirement (Montana [1]), by allowing each function to define the different kinds of data types it can accept. Moreover, each function, variable, and terminal are specified by certain types in stgp. When the stgp process generates an individual or performs a genetic operation during the simulated evolution process, it considers additional criteria that are not part of standard gp. For example, in our study of calibrating sqc models, a function in the function set can only be in query nodes, while the terminal variables such as fp and nfp can only be in the leaf nodes. Therefore, in our study we use stgp to build decision trees. In this study we investigated a simplified gp-based multi-objective optimization method for automatically building optimal sqc decision trees that have a high classification accuracy rate and a relatively small tree size. The gp-based decision trees were build to optimize two fitness functions: the average weighted cost of misclassification, and the tree size which is expressed in terms of the number of tree nodes. Moreover, the relative classification performances of the gp-based dt model and the that of standard gp were evaluated in the context of a real-world industrial software project. It was observed that the proposed approach yielded useful decision trees.
منابع مشابه
Dimensionality Reduction and Improving the Performance of Automatic Modulation Classification using Genetic Programming (RESEARCH NOTE)
This paper shows how we can make advantage of using genetic programming in selection of suitable features for automatic modulation recognition. Automatic modulation recognition is one of the essential components of modern receivers. In this regard, selection of suitable features may significantly affect the performance of the process. Simulations were conducted with 5db and 10db SNRs. Test and ...
متن کاملBuilding Ecological Models Using Genetic Programming
This paper reports on preliminary research using genetic programming (GP) to model the spatial distribution of an endangered Australian marsupial, the southern brown bandicoot(Isoodon obesulus). GP is compared in this work with classical decision tree learning. The models built by GP are surprisingly robust, generalising signiicantly better than those built by decision trees.
متن کاملThe Use of Genetic Algorithm, Clustering and Feature Selection Techniques in Construction of Decision Tree Models for Credit Scoring
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers. The main problem is the construction of decision trees that could classify customers optimally. This study presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model. It is based on genetic algorithm for credit scoring of bank customers in order ...
متن کاملComparison of Decision Tree and Naïve Bayes Methods in Classification of Researcher’s Cognitive Styles in Academic Environment
In today world of internet, it is important to feedback the users based on what they demand. Moreover, one of the important tasks in data mining is classification. Today, there are several classification techniques in order to solve the classification problems like Genetic Algorithm, Decision Tree, Bayesian and others. In this article, it is attempted to classify researchers to “Expert” and “No...
متن کاملComparison of Decision Tree and Naïve Bayes Methods in Classification of Researcher’s Cognitive Styles in Academic Environment
In today world of internet, it is important to feedback the users based on what they demand. Moreover, one of the important tasks in data mining is classification. Today, there are several classification techniques in order to solve the classification problems like Genetic Algorithm, Decision Tree, Bayesian and others. In this article, it is attempted to classify researchers to “Expert” and “No...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003